Assignment Instructions:

Assignment submission (Marie Tolteca): ______________________________________


library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(interactions) 
library(ggridges)
library(beeswarm)

Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! 🦞 Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals! 📊


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is ceteris paribus or whether selection bias is likely (be specific!).

The control sites may not provide a strong counterfactual for the treatment sites because assignment to Marine Protected Area status was not random. MPAs were designated through a policy process that likely considered ecological quality, habitat suitability, and existing fishing pressure. As a result, Naples and Isla Vista may have differed systematically from non-MPA reefs even before protection began, potentially having healthier ecosystems or higher baseline lobster abundance. These pre-existing differences introduce selection bias, making it difficult to attribute post 2012 differences in lobster abundance solely to MPA protection. Therefore, the ceteris paribus assumption is unlikely to hold.


Step 2: Read & wrangle data

a. Read in the raw data from the “data” folder named spiny_abundance_sb_18.csv. Name the data.frame rawdata

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)

rawdata <- read_csv(here('data','spiny_abundance_sb_18.csv'), na = '-99999') %>% 
    clean_names()

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
tidydata <- rawdata %>% 
    mutate(reef = case_when(
        site == "AQUE" ~ "Arroyo Quemado",
        site == "CARP" ~ "Carpenteria",
        site == "MOHK" ~ "Mohawk",
        site == "IVEE" ~ "Isla Vista",
        site == "NAPL" ~ "Naples"),
        reef = factor(reef,
                      levels = c("Arroyo Quemado","Carpenteria","Mohawk", "Isla Vista","Naples")))

Create new df named spiny_counts

spiny_counts <- tidydata

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

  • Create a variable mean_size from the variable size_mm
  • NOTE: The variable counts should have values which are integers (whole numbers).
  • Make sure to account for missing cases (na)!

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 

count <- tidydata %>% 
    group_by(year, site, transect) %>% 
    summarize(num_lobsters = sum(count,na.rm=TRUE), 
              mean_size = mean(size_mm,na.rm=TRUE), 
              .groups = "drop")

#HINT(e): Use `case_when()` to create the 3 new variable columns

spiny_counts <- count %>% 
    mutate(
        mpa = case_when(
            site %in% c('IVEE','NAPL') ~ "MPA",
            site %in% c('AQUE', 'CARP', 'MOHK') ~ "non_MPA"),
        treat = case_when(
            site %in% c('IVEE','NAPL') ~ 1,
            site %in% c('AQUE', 'CARP', 'MOHK')  ~ 0 
        ))

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year

Create a plot of lobster size :

  1. You choose the grouping variable(s)! # Add annotation for all plots
spiny_counts %>% 
  mutate(site = fct_rev(site)) %>%  
  ggplot(aes(x = num_lobsters, y = site)) +
  geom_density_ridges(alpha = 0.7) +
  scale_fill_viridis_d(option = "plasma") +
  geom_vline(xintercept = mean(spiny_counts$num_lobsters), 
             linetype = "dashed", color = "red", size = 0.5) + 
  labs(
    title = "Distribution of Lobster Counts by Site",
    x = "Number of Lobsters",
    y = "Site"
  ) +
    annotate("text", x =50, y = 6,label= "mean of 25" )+
  theme_light() +
  theme(legend.position = "none")
A ridge plot displaying the distribution of lobster counts by site. Sites are ordered reverse alphabetically. A red vertical dashed line indicates the overall mean count of approximately 25 lobsters across all sites.

A ridge plot displaying the distribution of lobster counts by site. Sites are ordered reverse alphabetically. A red vertical dashed line indicates the overall mean count of approximately 25 lobsters across all sites.

# 2
spiny_counts %>% 
  ggplot(aes(x = mpa, y = num_lobsters)) +
  geom_violin(fill = 'dodgerblue', alpha = 0.6) +  
  geom_jitter(width = 0.2, alpha = 0.5, size = 1.5) +  
    # Displaying median of number of lobsters
  geom_hline(aes(yintercept = mean(num_lobsters)), 
             linetype = "dashed", color = "red", size = 1) + 
  labs(
    title = "Lobster Counts by MPA Status",
    subtitle = "The red horizontal dash line represents the Mean number of lobsters of approximately 25",
    x = "Marine Protected Area (MPA) Status",
    y = "Number of Lobsters"
  ) +
    scale_x_discrete(labels = c("MPA", "Non-MPA"))+
  theme_light()
A violin plot with jittered points showing lobster count distribution by MPA status (MPA vs non-MPA). A red dashed horizontal line indicates the overall mean lobster count across both groups.

A violin plot with jittered points showing lobster count distribution by MPA status (MPA vs non-MPA). A red dashed horizontal line indicates the overall mean lobster count across both groups.

# 3
spiny_counts %>% 
  ggplot(aes(x = as.factor(year), y = num_lobsters)) +
  ggbeeswarm::geom_beeswarm(alpha = 0.6, size = 2) +
  scale_color_viridis_d(option = "plasma") +
  geom_hline(yintercept = median(spiny_counts$num_lobsters), 
             linetype = "dashed", color = "red", size = 1) + 
  labs(
    title = "Lobster Count Distribution by Year",
    subtitle = "The red horizontal dash line represents the median number of lobsters of approximately 10",
    x = "Year",
    y = "Number of Lobsters"
  ) +
  theme_light() +
  theme(legend.position = "none")
A beeswarm plot showing the distribution of lobster counts by year. Each point represents an observation, colored by year. A red dashed horizontal line indicates the overall median count of approximately 10 lobsters.

A beeswarm plot showing the distribution of lobster counts by year. Each point represents an observation, colored by year. A red dashed horizontal line indicates the overall median count of approximately 10 lobsters.

spiny_counts %>% 
  mutate(site = fct_rev(site)) %>%  
  ggplot(aes(y = site, x = mean_size)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  ggbeeswarm::geom_beeswarm(alpha = 0.4, size = 1.5) +
  scale_fill_viridis_d(option = "viridis") +
  labs(
    y = 'Site',
    x = 'Mean Size of Lobsters (mm)',
    title = "Distribution of Average Lobster Size by Site"
  ) +
  theme_light() +
  theme(legend.position = "none")
A boxplot overlaid with beeswarm points showing the distribution of mean lobster size by site. Sites are ordered reverse alphabetically. Boxplots show median, quartiles, and range, while individual points show the distribution of observations.

A boxplot overlaid with beeswarm points showing the distribution of mean lobster size by site. Sites are ordered reverse alphabetically. Boxplots show median, quartiles, and range, while individual points show the distribution of observations.

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()
spiny_counts %>% 
  select(mpa, num_lobsters) %>% 
  tbl_summary(
    by = mpa,
    # Getting Calcs
    statistic = all_continuous() ~ "{mean} ({sd})", 
    # Better Labeling
    label = num_lobsters ~ "Number of Lobsters" 
  ) %>%
  modify_header(
    # 
    stat_1 = "**MPA**",
    stat_2 = "**Non-MPA**"
  ) %>%
  bold_labels()
Characteristic MPA1 Non-MPA1
Number of Lobsters 28 (44) 23 (39)
1 Mean (SD)

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

- Intercept: On average, non-mpa sites recorded about 28 lobsters per survey.

- Treat: MPA sites recorded about 3 more lobsters on average than non-MPA sites, although this difference is not statistically significant.
# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)

m1_ols <- lm(num_lobsters ~ treat, 
             data = spiny_counts)

summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable num_lobsters
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
treat 5.36 5.20 1.03 0.30
Standard errors: OLS

c. Check the model assumptions using the check_model function from the performance package

d. Explain the results of the 4 diagnostic plots. Why are we getting this result? 1. In the first plot, the dots do not fall within the line, and they should be generally curved around zero. - Overall, The reason we might be getting these results is because this test might not be adequent for the data we have. Another general linear model can be used and run check_model functions.

check_model(m1_ols,  check = "qq" )

  1. The residuals fall outside of the normal curve, although it is generally close to the normal curve.
check_model(m1_ols, check = "normality")

  1. The reference line is not flat and horizontal, it follows the shape with variance around it.
check_model(m1_ols, check = "homogeneity")

  1. The model predicted data does not resemeble the spike of the observed data, however it does represent a normal distribution..
check_model(m1_ols, check = "pp_check")


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience. - Intercept: On average, the expected number of lobsters at non-MPA sites is about 23. - Treatment: On average, MPA sites have approximately 23% higher lobster counts than non-MPA sites.

c. Explain the statistical concept of dispersion and overdispersion in the context of this model. - In a Poisson model, dispersion assumes the mean and variance of the outcome are equal. Overdispersion occurs when the variance exceeds the mean, which can happen due to unobserved heterogeneity across reef sites, differences in habitat quality, or clustering of lobsters.

d. Compare results with previous model, explain change in the significance of the treatment effect - Compared to OLS, the Poisson model is designed for count data and accounts for skewness and the fact that counts cannot be negative. As a result, the Poisson model produces smaller standard errors and shows the treatment effect as statistically significant, whereas the OLS model overestimates uncertainty and fails to detect significance.

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 

#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`

m2_pois <-glm(num_lobsters ~ treat,
                  family = poisson(link = "log"),# Using log for multiplicative
                  data = spiny_counts
                  )

summ(m2_pois, model.fit = FALSE)
Observations 252
Dependent variable num_lobsters
Type Generalized linear model
Family poisson
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.02 171.74 0.00
treat 0.21 0.03 8.44 0.00
Standard errors: MLE
#> exp(3.12) = 22.64
#> (exp(0.21)-1)*100 = 23.36%

e. Check the model assumptions. Explain results. In the assumption of poisson, Not all counts fit the Poisson curve, mean ≠ dispersion (variance),and high proportion of zeros. Our results from the poisson model is that both the SE and pvalue are much smaller, meaning they are significant.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

When the check_model was ran for the m2_pois the figures are more accurate, although there may be another model that may be a better fit for our count data. In the results for overdispersion test we got a high ratio of 67.03. From the high ratio, we can tell that we our data has significant overdispersion. In the results for zero-inflation we have a zero predicted zeroes. Although, there are 27 observed zeros, this means that the model is underfitting zeros.

check_model(m2_pois)

check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001
check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

h. In 1-2 sentences explain rationale for fitting this GLM model. A negative binomial model was used because the lobster count data showed overdispersion, meaning the variance was larger than the mean. This model is more flexible than Poisson and produces more accurate standard errors, which in this case makes the treatment effect non-significant after accounting for extra variability.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model. Intercept: The model estimates that non-MPA sites have an expected lobster count of approximately 23 lobsters per survey. Treatment: MPA sites have, on average, about 23% higher lobster counts than non-MPA sites. This effect is statistically significant in the Poisson model.

Compared with the Poisson model, the negative binomial accounts for overdispersion, which increases uncertainty; as a result, the treatment effect is no longer statistically significant. Diagnostic checks show that the negative binomial fits the data better, with residuals and zero counts more accurately represented.

library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
# NOTE: The `glm.nb()` function does not require a `family` argument

m3_nb <- spiny_counts %>% 
    dplyr::select(num_lobsters,treat) %>% 
    glm.nb(num_lobsters ~ treat,
           data  = .)

summ(m3_nb, model.fit = FALSE)
Observations 252
Dependent variable num_lobsters
Type Generalized linear model
Family Negative Binomial(0.55)
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.12 26.40 0.00
treat 0.21 0.17 1.23 0.22
Standard errors: MLE
check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.400
##           p-value = 0.064
check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12
check_predictions(m3_nb)

check_model(m3_nb)


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications. The treatment effect is consistent across models, indicating fewer lobsters in non-MPA locations. However, the statistical significance of this effect is not robust to model choice. The Poisson model finds a highly significant negative effect, but diagnostic checks indicate overdispersion and excess zeros, which violate Poisson assumptions. When this extra variability is accounted for in the negative binomial model, SE increase and the treatment effect is no longer statistically significant. This suggests that while the estimated effect size is stable, the strength of the evidence for a treatment effect depends on the modeling assumptions.

export_summs(# ADD MODELS
    m1_ols, m2_pois, m3_nb,
             model.names = c("OLS","Poisson", "NB"),
             statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
treat5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following negative binomial model using glm.nb()

  • Add fixed effects for year (i.e., dummy coefficients)
  • Include an interaction term between variables treat & year (treat*year)

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

  • This model estimates differences in lobster counts between treated and untreated sites while controlling year to year. This interaction term being year, allows for variation and impact in treatment areas over time.

d. Explain why the main effect for treatment is negative? *Does this result make sense? The main effect for treatment is negative because, in a model with interactions. The main effect represents the treatment difference in the reference (baseline) year. Early in the study, treated sites may have had lower lobster counts before protections fully took effect. This result makes sense because the positive effects of MPAs accumulate over time, which is captured by the positive treatment and year interaction coefficients in later years.

ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))
    
m5_fixedeffs <- glm.nb(
    num_lobsters ~ 
        treat +
        year +
        treat*year,
    data = ff_counts)

summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable num_lobsters
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (b) above. In both plot, treated areas were lower than untreated locations. However, in both plots, it is notable that treated/ MPA locations have more abundance of lobsters compared to non-mpa locations.

p1 <- interact_plot(m5_fixedeffs, pred = year, modx = treat,
              outcome.scale = "link") + # NOTE: y-axis on log-scale
    labs(title = "Link Space",
         x = 'Year',
         y = 'Lobster Count')
p1 

# HINT: Change `outcome.scale` to "response" to convert y-axis scale to counts

p2 <- interact_plot(m5_fixedeffs, pred = year, modx = treat,
              outcome.scale = "response") + # NOTE: y-axis on log-scale
    labs(title = "Count Space",
         x = 'Year',
         y = 'Lobster Count',
         modx = 'Treatment')

p2

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

plot_counts <- ff_counts %>% 
    mutate(year=as_factor(year)) %>% 
    group_by(year, mpa) %>% 
    summarize(mean_count = mean(num_lobsters, na.rm = TRUE))

# plot_counts %>% ggplot() ...
plot_counts %>% ggplot(
    aes(x= year, y = mean_count, group = mpa, color = mpa))+
    geom_line() + 
    geom_point() +
    scale_color_manual(values = 
        c("MPA" = "dodgerblue4",
        "non_MPA" = "lightblue"),
        labels = c("MPA", "Non-MPA")
        ) +
    labs(x = "Year",
         y = "Mean Count (Lobsters)",
         title = "Mean Lobster Count by Year and MPA Status",
         color  =  "MPA") +
    theme_light()


Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)
  • Spillover effects are likely because lobsters can move between reef sites, and MPA protection may alter both lobster movement and fishing behavior at nearby non-MPA reefs, affecting outcomes beyond treated sites.
  1. Explain why spillover is an issue for the identification of causal effects
  • Spillover is an issue because if treatment in one site affects outcomes in other sites, it violates the assumption that each unit’s outcome depends only on its own treatment. This can bias estimates of the causal effect.
  1. How does spillover relate to impact in this research setting?
  • In this study, spillover would mean that lobster abundance in non-MPA sites changes because of nearby MPAs. If present, it could make treated and untreated sites more similar, underestimating the treatment effect.
  1. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption
    • Causal inference assumption: No interference: One unit’s treatment does not affect another unit’s outcome (MPA/Treat). No hidden variation: The treatment is implemented consistently for all units (i.e., all units receive the same treatment) SUTVA may be violated if Naples and Isla Vista MPAs operate under different regulatory rules, creating multiple versions of the treatment rather than a single, uniform MPA intervention.
    1. Exogeneity assumption The exogeneity assumption may be violated if MPA designation is correlated with unobserved factors that influence lobster abundance. Fishing behavior may shift around MPA boundaries, indirectly influencing lobster counts at both protected and unprotected sites. These alternative pathways make it difficult to attribute differences in only lobster abundance to MPA status.

EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (extracredit_sblobstrs24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)